Sunburst #This is an exemple of a sunburst plot
Parasite of ruminants cause diseases of major socio-economic importance worldwide.The Gastrointestinal Nematodes (GIN), are currently the most important disease problem of sheep worldwide causing reductionin meat, milk, and fibre production. GIN are rapidly becoming more and more resistant to all the different anthelmintic drug treatments, which is why analysing parasitic GIN community structure is essential for diagnostics, surveillance and research.
This is why the Nemabiome deep amplicon sequencing targeting ITS-2 (Internal Transcribed Spacer 2) region of nematode nuclear ribosomal DNA has been developed. This next generation sequencing provides a picture of the species composition of the GIN parasite community structure in small and large sample sets. The Nemabiome sequencing pipeline gives us the results condensed into a twelves sheet excel file which indicates IdTaxa (60 and 30) results summed up by species or by ASV.
IdTaxa is a taxonomic classification tool to classify rRNA or ITS sequences. ASV or Amplicon sequence variant refers to individual DNA sequences recovered using a high-throughput marker gene analysis when false sequences are removed after sequencing. The excel file also contains the results obtained from BLAST. BLAST or Basic Local Alignment Search Tool is a program which finds regions of local similarity between sequences by comparing either nucleotide or protein sequences to sequence databases in order to calculate statistical significance. Unfortunately, the results obtained in those different excel sheet are not simple to read and to analyse and the most compelling information’s might be lost without proper data visualization. Using the sequencing data output, we want to be able to understand the different classification algorithms used BLAST and IdTaxa and highlight various piece of information’s emerging from these classifications. We will be using interactive plots as an additional way to highlight any classification discrepancies which are hard to visualize by only looking at the excel spreadsheets.
Sunburst plot (exemple above) is a tool to enable visualisation of hierarchal data spanning outward radially from root to leaves. In this tutorial we will use this tool to visualize the Taxonomic classification of the gastrointestinal nematodes found in our samples, using the Nemabiome sequencing data.
The Walkthrough is using as an example a dataset of 10 deer samples
Ensure you know where all the following files are located on your computer (file directories):
You will need the FinalSummary.xlsx excel file obtained after running the Nemabiome pipeline. The “IdTaxa60Classification” tab will be specifically used.
The first step is to install the following packages. These packages are available in the Comprehensive R Archive Network (CRAN) repository and can be installed via install.packages(). By default, the latest version available will be installed.
dplyr package: Commonly use to modify data frame like objects, both in memory and out of memory by providing set of verbs such as select(), summarize() or group_by().
ggplot2 package: Used for the creation of graphics using “The Grammar of Graphics”. This package takes in consideration the variables mapped to its aesthetics function as well as the graphical function chosen and creates a graph.
openxlsx package: Provides an easier way to interact with Excel.xlsx files. Allows for the creation, styling and editing of excel spreadsheets
plotly package: Turns the graphics designed using ggplot2 into interactive web graphics.
reshape2 package: Used to restructure and aggregate data with function such as melt()
tidyr package: Used to tidy data sets using functions such as gather()
stringr package: Used for common string operations. This package contains a cohesive set of functions designed to work with strings in a simple manner
##remove the # to install, leave the # if already installed
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("openxlsx")
#install.packages("plotly")
#install.packages("reshape2")
#install.packages("tidyr")
#install.packages("stringr")
With the packages now installed, you can load them into R (or Rstudio) via the library() function.
## Load packages into R (Remove # to load)
#library(openxlsx)
#library(reshape2)
#library(ggplot2)
#library(plotly)
#library(dplyr)
#library(tidyr)
#library(stringr)
Here, you are telling R where your samples are located. This directory will need to be changed depending on where your samples are on the server/locally. The Variables to be Changed tab contains all directory changes in a more concise manner.
Setting Path to Samples: R needs to know where your samples are
My_Data <- read.xlsx("/Users/eleonorecharrier/Desktop/CODE/Summary.xlsx", sheet = "IdTaxa60Classification")
This step allows to format species colour into an object which can be reuse later on for graph aesthetic purposes.
https://www.rapidtables.com/web/color/RGB_Color.html
This link will provide easy hexadecimal colours code. You can pick your colour and format species colour. If you need to add new species, you can add another line as such: “genus_species” = “#hexadecimalcode”
## This first colour object will be later on used in the aesthetic for the stackbar plot
manualcolorSb <- c(
"NA" = 'black',
"Strongylida" = 'pink',
"Haemonchidae" = 'palevioletred',
"Chabertiidae" = '#ef8a62',
"Molineidae" = 'lightcoral',
"Spiculopteragia" = '#a86c80',
"Ostertagia" = '#a33358',
"Haemonchus" = '#a85470',
"Oesophagostomum" = '#fc8d62',
"Nematodirus" ='#fbb4ae',
"Spiculopteragia_boehmi" = '#825564',
"unclassified_Spiculopteragia" = '#80666e',
"unclassified_Ostertagia" = '#c23e6a',
"Haemonchus_placei" = '#e9a3c9',
"unclassified_Haemonchus" = '#d47f9b',
"Oesophagostomum_venulosum" = '#d8b365',
"Nematodirus_belvetianus" = '#ef8a62'
)
This step will allow the creation of custom functions which will be later of use for the construction of sunburst plot. We are creating these custom functions as a way to simplify our code as well as a way to make it more understandable. Those custom functions will also allow the user to have a code frame where only their data input will be changed. This will make the code more stable and reliable.
The first custom function will render possible the gathering of column of interest from the mother file of choice. This function will allow the user to replace both of those by: attributing the column of interest to “To_Replace” and the mother dataframe of choice to “Data_Of_Choice”. This function use the gather() function to collect a set of wanted column names and places them into a single “key” column. We use this function twice. A first time, to gather the column samples of choice into a single column called “Taxonomy_Value”, and a second time to gather the number of reads per “Taxonomy_Value”. We then use the group_by() to group our two columns by the “Taxonomy_Value” column. Finally, the summarise() on the grouped data previously created by the group_by() function. This function applied to the number of reads per species will create one row for each “Taxonomy_Value”. If ordered into a new object, this function will create a new data frame containing only the number of reads per taxonomy_value (species).
Function_Gather <-function(To_Replace, Data_Of_Choice) {
To_Replace <- enquo(To_Replace)
Data_Of_Choice %>%
gather(key = "Sample", value = "Read", !!To_Replace) %>%
gather(key = "taxonomy", value = "Taxonomy_Value", order:species) %>%
group_by(Taxonomy_Value) %>%
summarise(Total = sum(Read))
}
The second custom function will arrange a data frame to be assigned to “Name_Arrange” in a custom order. The arrange() and match() functions are used in this custom function to reorder the rows by one or more variable from the wanted column “Column_Of_Interest” which will be attributed to the object “Name_Taxonomy_Replace”. The custom order of the rows is the order in which your classification is in the “Labels” column of the “Sunburst Table” tab of the walkthrough.
Function_Arrange <- function(Name_Arrange, Column_Of_Interest ,Name_Taxonomy_Replace){
Name_Arrange %>%
arrange(match(Column_Of_Interest, c(Name_Taxonomy_Replace)))
}
The third custom function has for aim to create a data frame which can be use to create a Sunburt plot. In order to do so, we use the data.frame() function to create a dataframe with the different parameters needed for the construction of Sunburst plot.
The sunburst sectors are determined by the entries in “ids”, “labels” and “parents”. - ids: This will assign id labels to each datum. These ids will be used as an object constancy of the different data points during the sunburst animation. They should only be an array of strings. - labels: This will enable us to set the labels for each of the different sectors. See “Labels” column of the “Sunburst Table” tab of the walkthrough. - parents: This will set the parent sectors for each of the different created sectors. The use of empty string items ’’ will be understood as a way to reference the root node in the hierarchy. In this case, as the “ids” are filled, the “parents” items are understood to also be “ids” themselves. See “Parents” column of the “Sunburst Table” tab of the walkthrough. - size: This will attribute the size of each of the different sectors.
Function_DataFrame_Sb <- function(Data_Arrange_Taxonomy, Label_Order, Parent_Order, Data_Arrange_Total){
data.frame(
ids = c(Data_Arrange_Taxonomy) ,
labels = c(Label_Order),
parents = c(Parent_Order),
size = Data_Arrange_Total,
stringsAsFactors = FALSE)
}
The fourth and final custom function will be used to create and allow the Sunburst plot to be interactive. In order to do so, we use the plot_ly() function. the hovertext is attributed the data we want to appear as we hover the different sectors of the sunburst plot.
#This function will allow for the creation of the Sunburst Plot
Function_Sb <- function(Sunburst_Dataset, Dataset_Arrange_Total){
plot_ly(Sunburst_Dataset,
ids = ~ids,
labels = ~labels,
parents = ~parents,
type = 'sunburst',
hovertext = ~Dataset_Arrange_Total,
marker = list(colors = manualcolorSb))
}
This table is an example of the table created to understand the order of the sunburst more easily. This table is based on our example with 10 deer samples. This table will vary with each set of data.
Creating this sunburst will allow to more easily understand the taxonomic relationship between the different gastrointestinal species population found within our samples.
First, we need to use the read.xlsx() function to read specifically the IdTaxa60Classification sheet in the sequencing excel spreadsheet results. We will use the IdTaxa 60 classification results as it represents a confidence threshold of 60% which is higher than the IdTaxa 30 classification results, thus more robust. We also look at its structure to understand and look if changes need to be made.
My_Data <- read.xlsx("/Users/eleonorecharrier/Desktop/CODE/Summary.xlsx", sheet = "IdTaxa60Classification") #!!The first object of this function needs to be changed to wherever the `FinalSummary.xlxs` file is in your working directory
After looking at the structure of our data frame “My_Data”, we need to create a new data frame containing only the column which will be of our interest in the creation of the Sunburst plot.
We will use the custom function “Function_Gather” create above. This allows us to only select the column 193_S193_L001_R1_001.fastq.gzto the column273_S273_L001_R1_001.fastq.gz (includes all the fasta.gz column of the FinalSummary.xlxs file) and consider them as ‘Read’ values. This function gives us a new data frame “New_Data” grouped by taxonomy_value with the number of reads for each of these values.
New_Data <- Function_Gather(`193_S193_L001_R1_001.fastq.gz`:`273_S273_L001_R1_001.fastq.gz`, My_Data)
We then need to define NA as a string character “Unknown” in the Taxonomy_Value column. This will be needed for the next step of rearranging the data is a custom order. Our data is a data frame thus, a named list giving the value, here “Unknown” is used to replace NA for each column using the replace_na() function.
str(New_Data) #Check if the the colunm of interest were corectly selected
## Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of 2 variables:
## $ Taxonomy_Value: chr "Chabertiidae" "Haemonchidae" "Haemonchus" "Haemonchus_placei" ...
## $ Total : num 4409 230883 47433 42715 18953 ...
New_Data[17,1]="NA"
New_Data %>% replace_na(list(x = 0, y = "NA"))
## # A tibble: 17 x 2
## Taxonomy_Value Total
## <chr> <dbl>
## 1 Chabertiidae 4409
## 2 Haemonchidae 230883
## 3 Haemonchus 47433
## 4 Haemonchus_placei 42715
## 5 Molineidae 18953
## 6 Nematodirus 18953
## 7 Nematodirus_helvetianus 18953
## 8 Oesophagostomum 4409
## 9 Oesophagostomum_venulosum 4409
## 10 Ostertagia 43716
## 11 Spiculopteragia 139734
## 12 Spiculopteragia_boehmi 48385
## 13 Strongylida 254245
## 14 unclassified_Haemonchus 4718
## 15 unclassified_Ostertagia 43716
## 16 unclassified_Spiculopteragia 91349
## 17 NA 19368
This is the last step in rearranging our new data frame. We now have our data in two column, one for the taxonomy classification and one for the number of reads allocated to each level of the taxonomy classification. The last step is to arrange the data in the custom order needed for the organisation of the sunburst plot. To do so, we will use the custom function “Function_Arrange” create above.
New_Data_Arrange <- Function_Arrange(New_Data, New_Data$Taxonomy_Value ,c("NA", "Strongylida", "Haemonchidae", "Chabertiidae", "Molineidae", "Spiculopteragia", "Ostertagia", "Haemonchus", "Oesophagostomum", "Nematodirus", "Spiculopteragia_boehmi", "unclassified_Spiculopteragia", "unclassified_Ostertagia", "Haemonchus_placei", "unclassified_Haemonchus", "Oesophagostomum_venulosum", "Nematodirus_belvetianus"))
Using the custom function “Function_DataFrame_Sb” created above, we assign the ids, labels, parents and size to our sunburst plot.
DF_Sb1 <- Function_DataFrame_Sb(c(New_Data_Arrange$Taxonomy_Value),
c("NA", "Strongylida", "Haemonchidae", "Chabertiidae", "Molineidae", "Spiculopteragia", "Ostertagia", "Haemonchus", "Oesophagostomum", "Nematodirus", "Spiculopteragia<br>boehmi", "unclassified<br>Spiculopteragia", "unclassified<br>Ostertagia", "Haemonchus<br>placei", "unclassified<br>Haemonchus", "Oesophagostomum<br>venulosum", "Nematodirus<br>belvetianus"),
c("","", "Strongylida","Strongylida","Strongylida", "Haemonchidae", "Haemonchidae", "Haemonchidae","Chabertiidae", "Molineidae", "Spiculopteragia", "Spiculopteragia", "Ostertagia", "Haemonchus", "Haemonchus","Oesophagostomum", "Nematodirus"),
New_Data_Arrange$Total)
Finally, using the custom function “Function_Sb” created above we are constructing the sunburst plot of the “NUMBERS OF READS PER LEVEL OF TAXONOMY WITHIN YOUR SAMPLE”. It is an interactive plot, thus is you over each section, it will inform you on the number of reads for each taxa. In addition, by clicking on a specific taxa you can zoom the sunburst on that specific taxa.
Sb1 <- Function_Sb(DF_Sb1,New_Data_Arrange$Total)
Sb1 ##To visualize our sunburst plot
!The bold part are to be changed in your code (TO CHANGE)!
My_Data <- read.xlsx(TO CHANGE, sheet = "IdTaxa60Classification")
New_Data <- Function_Gather(TO CHANGE, My_Data)
New_Data[TO CHANGE,1]="NA"
New_Data_Arrange <- Function_Arrange(New_Data, New_Data$Taxonomy_Value ,c(TO CHANGE LABEL)
DF_Sb1 <- Function_DataFrame_Sb(c(New_Data_Arrange&Taxonomy_Value),
c(TO CHANGE LABEL),
c(TO CHANGE PARENT),
New_Data_Arrange$Total)
LINK TO THE DROPBOX FOLDER CONTAINING ALL RELEVANT FILES
This link will bring you to the dropbox folder containing all files mentioned within the walkthrough. This includes R files, relevant excel files, etc:
https://www.dropbox.com/sh/d19n2j1h27xvlqn/AACl14bR-wm5MPtU5XQ54luta?dl=0
NEMABIOME PIPELINE
. Nemabiome Pipeline walkthrough: https://jilldr.github.io/Dada2PipelineOperatingProcedure/#walkthrough
. Nemabiome Files: https://www.dropbox.com/sh/gks063m19qud9lx/AAC8w5bWWro22C8GAZ9MNZfTa?dl=0
GENERAL DOCUMENTATION
. https://www.rdocumentation.org
. Plotly: https://www.rdocumentation.org/packages/plotly/versions/4.9.2/topics/plot_ly
. Sunburst: https://plotly.com/r/sunburst-charts/